Data Science with
DAY 2
It is desired to assess the effectiveness of a new exercise programme which is believed to induce weight loss and improve self-rated health. A group of 5000 people were selected and randomized into two groups: 2,500 people to the treatment group (which receives the exercise intervention) and another 2,500 to the control group (does not receive the exercise intervention). Health outcomes, weight and self-rated health, were measured before the start of the intervention and immediately after the intervention.
The data was provided in 4 files:
*EXER_age_sex_race.csv : Subject demographics at baseline: age, sex, and race (we have seen this data when learning how to import comma separated values files)
*EXER_SRH.csv : Self-Rated Health (SRH) on an ordinal scale.
*EXER_weight_trt : Weight, in pounds, for Treatment Group.
*EXER_weight_con : Weight, in pounds, for Control Group.
Let us import these files using the read_csv() function and explore the variables they contain.
data1 = read_csv("DataFiles/EXER_age_sex_race.csv")
## Parsed with column specification:
## cols(
## subject_ID = col_integer(),
## SexAge_Race = col_character()
## )
glimpse(data1)
## Observations: 5,000
## Variables: 2
## $ subject_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ SexAge_Race <chr> "MALE41.2_White", "FEMALE42.9_White", "FEMALE38.5_...
We can see that there are 5000 subjects in the study and all the information about each subject is in one column. So, we will need to separate this information. Also, we can see that there are missing values for race (people can opt out from disclosing this type of information).
data2 = read_csv("DataFiles/EXER_SRH.csv")
## Parsed with column specification:
## cols(
## id = col_integer(),
## trt = col_integer(),
## TIME = col_character(),
## SRH = col_character()
## )
glimpse(data2)
## Observations: 10,035
## Variables: 4
## $ id <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...
unique(as.factor(data2$trt))
## [1] 1 0
## Levels: 0 1
This file contains the patient id (note the different name from the previous file), each participant’s self rated health before and after the study. The column trt can take on two values: 0 (individual didn’t exercise), 1 (individual exercised). Notice that data2 has 10035 rows but there should only be 10000 rows. This needs to be investigated.
data3 = read_csv("DataFiles/EXER_weight_trt.csv")
## Parsed with column specification:
## cols(
## Id = col_integer(),
## PRE_WEIGHT = col_double(),
## POST_WEIGHT = col_double()
## )
glimpse(data3)
## Observations: 5,017
## Variables: 3
## $ Id <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT <dbl> 135.2510, NA, 154.8713, NA, 128.1951, NA, 183.4600...
## $ POST_WEIGHT <dbl> NA, 125.6678, NA, 153.9882, NA, 115.5969, NA, 177....
This data set contains information about the weight of patients who received the exercise plan. The patient id, again a different name than in the previous two data sets, pre-intervention weight and post-intervention weight. Note that many NAs are generated: if pre-intervention weight is recorded, post-intervention weight is an NA, and vice-versa. Also, we expect 5000 observations, instead there are 5017.
data4 = read_csv("DataFiles/EXER_weight_con.csv")
## Parsed with column specification:
## cols(
## obs_ID = col_integer(),
## PRE_WEIGHT = col_double(),
## POST_WEIGHT = col_double()
## )
glimpse(data4)
## Observations: 5,012
## Variables: 3
## $ obs_ID <int> 2501, 2501, 2502, 2502, 2503, 2503, 2504, 2504, 25...
## $ PRE_WEIGHT <dbl> 159.7587, NA, 176.1611, NA, 181.3907, NA, 175.6615...
## $ POST_WEIGHT <dbl> NA, 158.6920, NA, 174.8270, NA, 179.9042, NA, 175....
This is information about the patients in the control group (didn’t follow the exercise programme). Again, the patient id column has a different name in this file and pre- and post- intervention weight are recorded in similarly as in the file for the treatment group. We expect 5000 observations, instead there are 5012.
In this study an observational unit is a patient, fixed variables are age, sex and race. Measured variables are self-rated health and weight at two time points, before and after the intervention.
This is an example of messy data where many features of a messy data set can already be observed. Firstly a single observational unit is stored in multiple tables, multiple variables are stored in one column, there is possibly duplicated information, and more messy features to be discovered.
The strategy is to deal with the data sets one by one and manipulate each of them so that they can be joined. We will have to join by subject id so let us name that column id in all four data frames.
data1)glimpse(data1)
## Observations: 5,000
## Variables: 2
## $ subject_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ SexAge_Race <chr> "MALE41.2_White", "FEMALE42.9_White", "FEMALE38.5_...
data1_1 = data1 # data1_1 will be the modified version of data1. Keep data1 for reference.
data1_1$SexAge_Race = str_replace_all(data1$SexAge_Race, "MALE", "MALE_") # we add a _ after MALE to separate sex and age
data1_1 = separate(data1_1, SexAge_Race, c("Sex", "Age", "Race"), sep = "_") # we separate values between _
glimpse(data1_1)
## Observations: 5,000
## Variables: 4
## $ subject_ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FE...
## $ Age <chr> "41.2", "42.9", "38.5", "35.6", "48.5", "36.9", "28...
## $ Race <chr> "White", "White", "White", "Hispanic", "White", "NA...
data1_1$Age = as.numeric(data1_1$Age) # we make Age a numeric variable
names(data1_1)[1] = "id" # rename the first column
glimpse(data1_1)
## Observations: 5,000
## Variables: 4
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE",...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 37....
## $ Race <chr> "White", "White", "White", "Hispanic", "White", "NA", "Wh...
Now note that the Race column of data1_1 has the value “NA”, which R reads as a character string and not as a missing value. We need to transform the entries “NA” into NA.
aux = str_detect(data1_1$Race, "NA")
data1_1$Race[aux] = NA
glimpse(data1_1)
## Observations: 5,000
## Variables: 4
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE",...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 37....
## $ Race <chr> "White", "White", "White", "Hispanic", "White", NA, "Whit...
summary(as.factor(data1_1$Race))
## Asian Black Hispanic White NA's
## 450 424 846 2876 404
summary(as.factor(data1_1$Sex))
## FEMALE MALE
## 2556 2444
summary(data1_1$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.10 33.40 38.80 38.89 44.10 69.30
data2)Firstly, we must resolve the issue that there are 10,035 rows when we expect 10,000. We will first attempt to remove duplicate rows.
glimpse(data2)
## Observations: 10,035
## Variables: 4
## $ id <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...
data2_1 = data2#we keep data2 intact for reference. Changes will go into data2_1
data2_1 = distinct(data2)#remove duplicate rows
glimpse(data2_1)
## Observations: 10,016
## Variables: 4
## $ id <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...
After removing duplicate rows now we have 10,016 rows. We will obtain summaries of the columns to see if we can spot something unusual.
summary(as.factor(data2_1$trt))
## 0 1
## 5008 5008
summary(as.factor(data2_1$TIME))
## POST PRE
## 5003 5013
summary(as.factor(data2_1$SRH))
## Excellent Good Poor Satisfactory Very Poor
## 2947 1570 1711 1411 16
## Very Poor
## 2361
We spotted something unusual! The SRH level Very Poor appears also with a double space between Very and Poor. This happens for 16 rows.
data2_1$SRH = str_replace_all(data2_1$SRH, "Very Poor", "Very Poor")
glimpse(data2_1)
## Observations: 10,016
## Variables: 4
## $ id <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...
summary(as.factor(data2_1$SRH))
## Excellent Good Poor Satisfactory Very Poor
## 2947 1570 1711 1411 2377
data2_1 = distinct(data2_1)#remove duplicate rows
glimpse(data2_1)
## Observations: 10,000
## Variables: 4
## $ id <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...
We have two rows per patient. We will spread the TIME column, to create two new columns with PRE-SRH and POST-SRH. Remember that eventually we have to merge all four data sets into one. The first data set has one row per observation and we will follow that format.
data2_1 = spread(data2_1, TIME, SRH)
glimpse(data2_1)
## Observations: 5,000
## Variables: 4
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ trt <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ POST <chr> "Poor", "Very Poor", "Good", "Good", "Poor", "Excellent",...
## $ PRE <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "Good", "...
We change the names of the 3rd and 4th columns to POST_SRH and PRE_SRH and swap their order.
names(data2_1)[3:4] = c("POST_SRH", "PRE_SRH")
data2_1 = data2_1[,c(1,2,4,3)]
glimpse(data2_1)
## Observations: 5,000
## Variables: 4
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ trt <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ PRE_SRH <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "Good...
## $ POST_SRH <chr> "Poor", "Very Poor", "Good", "Good", "Poor", "Excelle...
data3)Here we have the pre- and post-weights of the patients who received the exercise plan treatment. We will change the name of the first column to id, in line with the two previous data sets, keeping in mind we need to merge the data sets once they are in suitable form and tidied up.
glimpse(data3)
## Observations: 5,017
## Variables: 3
## $ Id <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT <dbl> 135.2510, NA, 154.8713, NA, 128.1951, NA, 183.4600...
## $ POST_WEIGHT <dbl> NA, 125.6678, NA, 153.9882, NA, 115.5969, NA, 177....
head(data3)
## # A tibble: 6 x 3
## Id PRE_WEIGHT POST_WEIGHT
## <int> <dbl> <dbl>
## 1 1 135. NA
## 2 1 NA 126.
## 3 2 155. NA
## 4 2 NA 154.
## 5 3 128. NA
## 6 3 NA 116.
data3_1 = data3
names(data3_1)[1] = "id"
There should be 5,000 rows in data3 but there are 5,017 rows. Note the very inefficient way the data is recorded. There should be just one row per patient. Instead there are two and lots of NAs.
data3_1 = distinct(data3_1)#remove duplicate rows
glimpse(data3_1)
## Observations: 5,017
## Variables: 3
## $ id <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT <dbl> 135.2510, NA, 154.8713, NA, 128.1951, NA, 183.4600...
## $ POST_WEIGHT <dbl> NA, 125.6678, NA, 153.9882, NA, 115.5969, NA, 177....
There are no exact duplicate rows in data3. Let us see if there are more than two records per patient. The function count in the dplyr package can be used to count how many instances for each patient id there are in the data set. For example:
aux1 = count(data3_1[1:10,],id)#counts how many instances of each patient id there are in the first twenty rows
aux1
## # A tibble: 5 x 2
## id n
## <int> <int>
## 1 1 2
## 2 2 2
## 3 3 2
## 4 4 2
## 5 5 2
The result has two columns. One is id and the other one is the corresponding count, n.
aux1 = count(data3_1,id)#counts how many instances of each patient id there are in the whole data set
aux2 = which(aux1$n > 2)
aux3 = aux1$id[aux2]
aux3
## [1] 2394 2395 2396 2397 2398 2399 2400 2401 2402
aux3 gives the patient id’s with more than two entries. Let us explore the data for those patients.
ff = filter(data3_1, id %in% aux3)
as.data.frame(ff)
## id PRE_WEIGHT POST_WEIGHT
## 1 2394 137.0663 NA
## 2 2394 NA 128.2492
## 3 2394 NA 128.2000
## 4 2395 164.3000 NA
## 5 2395 164.3255 NA
## 6 2395 NA 164.3350
## 7 2395 NA 164.3500
## 8 2396 160.9983 NA
## 9 2396 160.9985 NA
## 10 2396 NA 165.7400
## 11 2396 NA 165.7270
## 12 2397 147.7983 NA
## 13 2397 147.7990 NA
## 14 2397 NA 137.8439
## 15 2397 NA 137.8500
## 16 2398 133.1364 NA
## 17 2398 133.1400 NA
## 18 2398 NA 118.9500
## 19 2398 NA 118.9398
## 20 2399 188.8684 NA
## 21 2399 188.9000 NA
## 22 2399 NA 179.1000
## 23 2399 NA 179.0564
## 24 2400 166.1632 NA
## 25 2400 166.2000 NA
## 26 2400 NA 150.2311
## 27 2400 NA 150.2000
## 28 2401 151.6768 NA
## 29 2401 151.7000 NA
## 30 2401 NA 133.3000
## 31 2401 NA 133.2508
## 32 2402 148.7000 NA
## 33 2402 148.7475 NA
## 34 2402 NA 136.2554
## 35 2402 NA 136.6000
So, we see that for these patients there are double entries, sometimes for the pre-weight, post-weight, or both. The strategy we will use to deal with this is to truncate the values of weight (pre- and post-intervention). Truncating a number means retaining the integer part of it only. So, for example, truncating 1.2 and 1.9 gives the same result, 1. This will hardly affect the results of the study. Truncating the weight data will create exact row duplicates and we will then eliminate duplicates. The function mutate_all() in the dplyr package is useful to apply a function to each column. In this case the function we want to apply is trunc().
data3_1 = mutate_all(data3_1, trunc)#this call truncates all the columns to the nearest integer
glimpse(data3_1)
## Observations: 5,017
## Variables: 3
## $ id <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT <dbl> 135, NA, 154, NA, 128, NA, 183, NA, 166, NA, 120, ...
## $ POST_WEIGHT <dbl> NA, 125, NA, 153, NA, 115, NA, 177, NA, 163, NA, 1...
Now, we remove duplicate rows:
data3_1 = distinct(data3_1)
glimpse(data3_1)
## Observations: 5,000
## Variables: 3
## $ id <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT <dbl> 135, NA, 154, NA, 128, NA, 183, NA, 166, NA, 120, ...
## $ POST_WEIGHT <dbl> NA, 125, NA, 153, NA, 115, NA, 177, NA, 163, NA, 1...
Now we have 5000 rows but we need to have one row per patient with their id, pre-weight and post-weight.
The strategy to achieve this is to select id and pre-weight, eliminate the rows where pre-weight is NA. Next select id and post-weight and eliminate the rows where post-weight value is NA. Next merge both data frames into one by patient id.
temp1 = select(data3_1, -POST_WEIGHT)#select all columns except POST_WEIGHT
temp1 = filter(temp1, is.na(PRE_WEIGHT) == "FALSE")#filter out the NAs in PRE_WEIGHT
glimpse(temp1)
## Observations: 2,500
## Variables: 2
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ PRE_WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 1...
#
temp2 = select(data3_1, -PRE_WEIGHT)#select all columns except PRE_WEIGHT
temp2 = filter(temp2, is.na(POST_WEIGHT) == "FALSE")#filter out the NAs in POST_WEIGHT
glimpse(temp2)
## Observations: 2,500
## Variables: 2
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
#
data3_1 = inner_join(temp1,temp2)
## Joining, by = "id"
glimpse(data3_1)
## Observations: 2,500
## Variables: 3
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ PRE_WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
data4)glimpse(data4)
## Observations: 5,012
## Variables: 3
## $ obs_ID <int> 2501, 2501, 2502, 2502, 2503, 2503, 2504, 2504, 25...
## $ PRE_WEIGHT <dbl> 159.7587, NA, 176.1611, NA, 181.3907, NA, 175.6615...
## $ POST_WEIGHT <dbl> NA, 158.6920, NA, 174.8270, NA, 179.9042, NA, 175....
head(data4)
## # A tibble: 6 x 3
## obs_ID PRE_WEIGHT POST_WEIGHT
## <int> <dbl> <dbl>
## 1 2501 160. NA
## 2 2501 NA 159.
## 3 2502 176. NA
## 4 2502 NA 175.
## 5 2503 181. NA
## 6 2503 NA 180.
data4_1 = data4
names(data4_1)[1] = "id"
There are 5,012 rows in this data set when there should be just 5,000. Similar inefficient way of recording information as in previous data set.
Let us first investigate if there are patient with duplicate records.
data4_1 = distinct(data4_1)
glimpse(data4_1)
## Observations: 5,012
## Variables: 3
## $ id <int> 2501, 2501, 2502, 2502, 2503, 2503, 2504, 2504, 25...
## $ PRE_WEIGHT <dbl> 159.7587, NA, 176.1611, NA, 181.3907, NA, 175.6615...
## $ POST_WEIGHT <dbl> NA, 158.6920, NA, 174.8270, NA, 179.9042, NA, 175....
There aren’t any duplicate rows.
Let us investigate if any patients have duplicate records.
aux1 = count(data4_1,id)#counts how many instances of each patient id there are in the whole data set
aux2 = which(aux1$n > 2)
aux3 = aux1$id[aux2]
aux3
## [1] 4980 4981 4982 4983 4984 4985
Patients 4980, 4981, 4982, 4983, 4984 and 4985 have more than two records each.
filter(data4_1, id %in% aux3)
## # A tibble: 24 x 3
## id PRE_WEIGHT POST_WEIGHT
## <int> <dbl> <dbl>
## 1 4980 152. NA
## 2 4980 152. NA
## 3 4980 NA 151.
## 4 4980 NA 151.
## 5 4981 171. NA
## 6 4981 171. NA
## 7 4981 NA 168.
## 8 4981 NA 168.
## 9 4982 155. NA
## 10 4982 155. NA
## # ... with 14 more rows
As before, we truncate the data and eliminate any duplicate rows.
data4_1 = mutate_all(data4_1,trunc)#this call truncates all the columns to 1 decimal place
data4_1 = distinct(data4_1)#eliminate duplicate rows
glimpse(data4_1)
## Observations: 5,000
## Variables: 3
## $ id <dbl> 2501, 2501, 2502, 2502, 2503, 2503, 2504, 2504, 25...
## $ PRE_WEIGHT <dbl> 159, NA, 176, NA, 181, NA, 175, NA, 136, NA, 160, ...
## $ POST_WEIGHT <dbl> NA, 158, NA, 174, NA, 179, NA, 175, NA, 137, NA, 1...
We have 5,000 rows now.
We have to create a data frame with one row per patient, as before.
temp1 = select(data4_1, -POST_WEIGHT)#select all columns except POST_WEIGHT
temp1 = filter(temp1, is.na(PRE_WEIGHT) == "FALSE")#filter out the NAs in PRE_WEIGHT
glimpse(temp1)
## Observations: 2,500
## Variables: 2
## $ id <dbl> 2501, 2502, 2503, 2504, 2505, 2506, 2507, 2508, 250...
## $ PRE_WEIGHT <dbl> 159, 176, 181, 175, 136, 160, 153, 163, 161, 178, 1...
#
temp2 = select(data4_1, -PRE_WEIGHT)#select all columns except PRE_WEIGHT
temp2 = filter(temp2, is.na(POST_WEIGHT) == "FALSE")#filter out the NAs in POST_WEIGHT
glimpse(temp2)
## Observations: 2,500
## Variables: 2
## $ id <dbl> 2501, 2502, 2503, 2504, 2505, 2506, 2507, 2508, 25...
## $ POST_WEIGHT <dbl> 158, 174, 179, 175, 137, 159, 155, 164, 160, 177, ...
#
data4_1 = inner_join(temp1,temp2)
## Joining, by = "id"
glimpse(data4_1)
## Observations: 2,500
## Variables: 3
## $ id <dbl> 2501, 2502, 2503, 2504, 2505, 2506, 2507, 2508, 25...
## $ PRE_WEIGHT <dbl> 159, 176, 181, 175, 136, 160, 153, 163, 161, 178, ...
## $ POST_WEIGHT <dbl> 158, 174, 179, 175, 137, 159, 155, 164, 160, 177, ...
Next we are going to join the rows of data3_1 and data4_1 to create data3_2 so that we have all patients in one data frame. First we are going to add a column in data3_1 called trt with all values equal to 1, and a column in data4_1 called trt as well but with all values equal to zero.
data3_1$trt = rep(1, nrow(data3_1))
data4_1$trt = rep(0, nrow(data4_1))
head(data3_1)
## # A tibble: 6 x 4
## id PRE_WEIGHT POST_WEIGHT trt
## <dbl> <dbl> <dbl> <dbl>
## 1 1 135 125 1
## 2 2 154 153 1
## 3 3 128 115 1
## 4 4 183 177 1
## 5 5 166 163 1
## 6 6 120 105 1
head(data4_1)
## # A tibble: 6 x 4
## id PRE_WEIGHT POST_WEIGHT trt
## <dbl> <dbl> <dbl> <dbl>
## 1 2501 159 158 0
## 2 2502 176 174 0
## 3 2503 181 179 0
## 4 2504 175 175 0
## 5 2505 136 137 0
## 6 2506 160 159 0
#
data3_2 = bind_rows(data3_1,data4_1)
glimpse(data3_2)
## Observations: 5,000
## Variables: 4
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ PRE_WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
## $ trt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
Now we must merge data1_1, data2_1 and data3_2.
data_exer1 = inner_join(data1_1, data2_1)
## Joining, by = "id"
data_exer1 = inner_join(data_exer1, data3_2)
## Joining, by = c("id", "trt")
glimpse(data_exer1)
## Observations: 5,000
## Variables: 9
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "F...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Race <chr> "White", "White", "White", "Hispanic", "White", NA...
## $ trt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ PRE_SRH <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "G...
## $ POST_SRH <chr> "Poor", "Very Poor", "Good", "Good", "Poor", "Exce...
## $ PRE_WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
We can further narrow the number of variables introducing a variable Time, with values PRE and POST, and gather SRH and Weight. To do that we will first create one data frame with all the fixed variables and pre- and post-SRH and another data frame with all the fixed variables and pre- and post-WEIGHT. We will gather the pre- and post- column into Time and SRH or WEIGHT and then we will join both data sets.
temp1 = data_exer1[,1:7]
temp1 = gather(temp1, "Time", "SRH", 6:7)
temp1$Time = str_replace_all(temp1$Time, "PRE_SRH", "PRE")
temp1$Time = str_replace_all(temp1$Time, "POST_SRH", "POST")
glimpse(temp1)
## Observations: 10,000
## Variables: 7
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE",...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 37....
## $ Race <chr> "White", "White", "White", "Hispanic", "White", NA, "Whit...
## $ trt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Time <chr> "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "...
## $ SRH <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "Good", "...
temp2 = data_exer1[,c(1:5,8,9)]
temp2 = gather(temp2, "Time", "WEIGHT", 6:7)
temp2$Time = str_replace_all(temp2$Time, "PRE_WEIGHT", "PRE")
temp2$Time = str_replace_all(temp2$Time, "POST_WEIGHT", "POST")
glimpse(temp2)
## Observations: 10,000
## Variables: 7
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 3...
## $ Race <chr> "White", "White", "White", "Hispanic", "White", NA, "Wh...
## $ trt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Time <chr> "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE",...
## $ WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149, ...
We now merge temp1 and temp2
data_exer2 = inner_join(temp1, temp2)
## Joining, by = c("id", "Sex", "Age", "Race", "trt", "Time")
glimpse(data_exer2)
## Observations: 10,000
## Variables: 8
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Sex <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 3...
## $ Race <chr> "White", "White", "White", "Hispanic", "White", NA, "Wh...
## $ trt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Time <chr> "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE",...
## $ SRH <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "Good",...
## $ WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149, ...
Both data_exer1 and data_exer2 are tidy, and it is a matter of preference and how the data will be used to choose a format to work with.
We will work with both data frames for the purpose of visualising the data.
The variable SRH is an ordinal categorical variable. Let us tell R about the ordinal features of SRH. Also, Time is categorical ordinal. Sex and Race are simply categorical. Treatment, trt, is also a factor and we will let R know and re-label the levels from 0 to Control and 1 to Treatment. We will do this for both data frames data_exer1 and data_exer2.
#transform character columns into factors for data_exer1
data_exer1$PRE_SRH = factor(data_exer1$PRE_SRH, levels = c("Very Poor", "Poor", "Satisfactory", "Good", "Excellent"),ordered = TRUE)
data_exer1$POST_SRH = factor(data_exer1$POST_SRH, levels = c("Very Poor", "Poor", "Satisfactory", "Good", "Excellent"),ordered = TRUE)
data_exer1$Sex = factor(data_exer1$Sex, levels = c("FEMALE", "MALE"))
data_exer1$Race = as.factor(data_exer1$Race)
data_exer1$trt = factor(as.character(data_exer1$trt), levels = c("0", "1"), labels = c("Control", "Treatment"))
glimpse(data_exer1)
## Observations: 5,000
## Variables: 9
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMA...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Race <fct> White, White, White, Hispanic, White, NA, White, H...
## $ trt <fct> Treatment, Treatment, Treatment, Treatment, Treatm...
## $ PRE_SRH <ord> Good, Poor, Satisfactory, Poor, Poor, Good, Excell...
## $ POST_SRH <ord> Poor, Very Poor, Good, Good, Poor, Excellent, Exce...
## $ PRE_WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
#transform character columns into factors for data_exer2
data_exer2$SRH = factor(data_exer2$SRH, levels = c("Very Poor", "Poor", "Satisfactory", "Good", "Excellent"),
ordered = TRUE)
data_exer2$Time = factor(data_exer2$Time, levels = c("PRE", "POST"), ordered = TRUE)#Time is also an ordered factor
data_exer2$Sex = factor(data_exer2$Sex, levels = c("FEMALE", "MALE"))
data_exer2$Race = as.factor(data_exer2$Race)
data_exer2$trt = factor(as.character(data_exer2$trt), levels = c("0", "1"), labels = c("Control", "Treatment"))
glimpse(data_exer2)
## Observations: 10,000
## Variables: 8
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Sex <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, M...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 3...
## $ Race <fct> White, White, White, Hispanic, White, NA, White, Hispan...
## $ trt <fct> Treatment, Treatment, Treatment, Treatment, Treatment, ...
## $ Time <ord> PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, ...
## $ SRH <ord> Good, Poor, Satisfactory, Poor, Poor, Good, Excellent, ...
## $ WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149, ...
The variable Age is of a continuous nature. However, it is not really expected that at each small change in age we would observe changes in treatment effects. Also, in case effects change according to age, a change of, say 5 years, at young age will not possibly see the same effect as a change of 5 years at middle or old age. Therefore, we will consider age groups rather than a continuous range of age. Finding optimal age groups is in itself a topic that can be discussed at length. For the purposes of exploration, we will transform the age variable into categorical type with groups. Note that we will use Age from data_exer1 because it contains just one row per observational unit.
summary(data_exer1$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.10 33.40 38.80 38.89 44.10 69.30
aux = cut(data_exer1$Age, c(18, 20, 30, 40, 50, 60,70))
hist(data_exer1$Age)
summary(aux)
## (18,20] (20,30] (30,40] (40,50] (50,60] (60,70]
## 33 649 2133 1779 374 32
We see that most of the participants are between 30 and 50 years of age. No participants are younger than 18 years or older than 70 years. We will split age into 18-34yrs, 35-40yrs, 41-45yrs, 46-70yrs, so that we don’t have an age category over-represented with number of patients. We create a new column in both data_exer1 and data_exer2 called Age_cat which indicates the age group of the patient.
data_exer1$Age_cat = cut(data_exer1$Age, c(18, 34, 39, 44, 70))
data_exer2$Age_cat = cut(data_exer2$Age, c(18, 34, 39, 44, 70))
summary(data_exer1$Age_cat)
## (18,34] (34,39] (39,44] (44,70]
## 1345 1223 1162 1270
Let us see how the data looks like now
glimpse(data_exer1)
## Observations: 5,000
## Variables: 10
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMA...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Race <fct> White, White, White, Hispanic, White, NA, White, H...
## $ trt <fct> Treatment, Treatment, Treatment, Treatment, Treatm...
## $ PRE_SRH <ord> Good, Poor, Satisfactory, Poor, Poor, Good, Excell...
## $ POST_SRH <ord> Poor, Very Poor, Good, Good, Poor, Excellent, Exce...
## $ PRE_WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
## $ Age_cat <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,3...
glimpse(data_exer2)
## Observations: 10,000
## Variables: 9
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Sex <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, ...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, ...
## $ Race <fct> White, White, White, Hispanic, White, NA, White, Hispa...
## $ trt <fct> Treatment, Treatment, Treatment, Treatment, Treatment,...
## $ Time <ord> PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE,...
## $ SRH <ord> Good, Poor, Satisfactory, Poor, Poor, Good, Excellent,...
## $ WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149,...
## $ Age_cat <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,39], ...
Let us shuffle the columns so that Age and Age_cat are contiguous.
data_exer1 = data_exer1[,c(1:3,10,4:9)]
data_exer2 = data_exer2[,c(1:3,9,4:8)]
glimpse(data_exer1)
## Observations: 5,000
## Variables: 10
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMA...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Age_cat <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,3...
## $ Race <fct> White, White, White, Hispanic, White, NA, White, H...
## $ trt <fct> Treatment, Treatment, Treatment, Treatment, Treatm...
## $ PRE_SRH <ord> Good, Poor, Satisfactory, Poor, Poor, Good, Excell...
## $ POST_SRH <ord> Poor, Very Poor, Good, Good, Poor, Excellent, Exce...
## $ PRE_WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
glimpse(data_exer2)
## Observations: 10,000
## Variables: 9
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Sex <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, ...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, ...
## $ Age_cat <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,39], ...
## $ Race <fct> White, White, White, Hispanic, White, NA, White, Hispa...
## $ trt <fct> Treatment, Treatment, Treatment, Treatment, Treatment,...
## $ Time <ord> PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE,...
## $ SRH <ord> Good, Poor, Satisfactory, Poor, Poor, Good, Excellent,...
## $ WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149,...
summary(data_exer2$trt)
## Control Treatment
## 5000 5000
Let us see a few ways in which we can tabulate some aspects of the data. We can use table(), ftable() (more than two variables) or xtabs().
with(data_exer2, table(SRH,Time))
## Time
## SRH PRE POST
## Very Poor 1321 1040
## Poor 959 752
## Satisfactory 611 800
## Good 736 834
## Excellent 1373 1574
Without considering any other factor, there are less patients in the Poor and Very Poor categories after the intervention, and there are more patients in the Satisfactory, Good and Excellent categories after the intervention than before the intervention.
Let us see if the trend stays when we consider also the patient’s gender.
aux2 = with(data_exer2, table(SRH,Time,Sex))
aux2
## , , Sex = FEMALE
##
## Time
## SRH PRE POST
## Very Poor 660 576
## Poor 443 414
## Satisfactory 337 367
## Good 391 382
## Excellent 725 817
##
## , , Sex = MALE
##
## Time
## SRH PRE POST
## Very Poor 661 464
## Poor 516 338
## Satisfactory 274 433
## Good 345 452
## Excellent 648 757
or alternatively,
aux3 = with(data_exer2, ftable(SRH,Time,Sex))
aux3
## Sex FEMALE MALE
## SRH Time
## Very Poor PRE 660 661
## POST 576 464
## Poor PRE 443 516
## POST 414 338
## Satisfactory PRE 337 274
## POST 367 433
## Good PRE 391 345
## POST 382 452
## Excellent PRE 725 648
## POST 817 757
or
aux4 = xtabs(~SRH + Time + Sex, data = data_exer2)
aux4
## , , Sex = FEMALE
##
## Time
## SRH PRE POST
## Very Poor 660 576
## Poor 443 414
## Satisfactory 337 367
## Good 391 382
## Excellent 725 817
##
## , , Sex = MALE
##
## Time
## SRH PRE POST
## Very Poor 661 464
## Poor 516 338
## Satisfactory 274 433
## Good 345 452
## Excellent 648 757
The trend of positive effect of the exercise plan is still present when we consider gender although it seems to be stronger for men than for women.
with(data_exer2, table(SRH, Time, Age_cat,trt))
## , , Age_cat = (18,34], trt = Control
##
## Time
## SRH PRE POST
## Very Poor 7 6
## Poor 0 0
## Satisfactory 35 47
## Good 25 25
## Excellent 623 612
##
## , , Age_cat = (34,39], trt = Control
##
## Time
## SRH PRE POST
## Very Poor 104 86
## Poor 78 99
## Satisfactory 127 127
## Good 224 222
## Excellent 76 75
##
## , , Age_cat = (39,44], trt = Control
##
## Time
## SRH PRE POST
## Very Poor 212 141
## Poor 155 166
## Satisfactory 102 165
## Good 96 93
## Excellent 0 0
##
## , , Age_cat = (44,70], trt = Control
##
## Time
## SRH PRE POST
## Very Poor 335 383
## Poor 230 220
## Satisfactory 51 20
## Good 20 13
## Excellent 0 0
##
## , , Age_cat = (18,34], trt = Treatment
##
## Time
## SRH PRE POST
## Very Poor 4 3
## Poor 0 0
## Satisfactory 35 0
## Good 10 0
## Excellent 606 652
##
## , , Age_cat = (34,39], trt = Treatment
##
## Time
## SRH PRE POST
## Very Poor 67 37
## Poor 108 46
## Satisfactory 140 12
## Good 231 284
## Excellent 68 235
##
## , , Age_cat = (39,44], trt = Treatment
##
## Time
## SRH PRE POST
## Very Poor 201 70
## Poor 170 96
## Satisfactory 121 270
## Good 105 161
## Excellent 0 0
##
## , , Age_cat = (44,70], trt = Treatment
##
## Time
## SRH PRE POST
## Very Poor 391 314
## Poor 218 125
## Satisfactory 0 159
## Good 25 36
## Excellent 0 0
ggplot(data = data_exer2) +
geom_mosaic(aes( x = product(SRH, Age_cat, Sex), fill = SRH), na.rm = TRUE) +
theme(axis.text.x=element_text(angle = -90, hjust = .1, size = 4)) +
facet_grid(trt~Time, drop = TRUE) +
scale_y_continuous("Age Group", breaks = c(0,0.25,0.5, 0.75)+0.25/2 , labels = c("18-34", "35-39", "40-44", "45-70")) +
labs(x = "Gender", title ='SRH by age and gender for treatment and control groups') +
guides(fill = guide_legend(title = "SRH", reverse = TRUE))+
theme(legend.position = "bottom")
ggplot(data = subset(data_exer2, is.na(Race) == "FALSE")) +
geom_mosaic(aes( x = product(SRH, Race, Sex), fill = SRH), na.rm = TRUE) +
theme(axis.text.x = element_text(angle = -90, hjust = .1, size = 4)) +
facet_grid(trt ~ Time) +
scale_y_continuous("Race", breaks = c(0.05,0.15,0.3, 0.65) , labels = c("Asian", "Black", "Hispanic", "White")) +
labs(x = "Gender", title = 'SRH by age and gender for treatment and control groups') +
guides(fill = guide_legend(title = "SRH", reverse = TRUE))+
theme(legend.position = "bottom")
Let us visualise now the measured variables that are continuous, namely Weight. We plot POST_WEIGHT vs PRE_WEIGHT for males (blue) and females (red), before (solid line) and after (dotted line) the intervention. We also add a 45 degree line through zero (symbolising no treatment effect on weight, i.e. weight before is equal to weight after).
ggplot(data_exer1, aes(x = PRE_WEIGHT, y = POST_WEIGHT, col = Sex, linetype = trt)) +
geom_point(alpha = 0.2) +
stat_smooth(method = "loess", se = FALSE,lwd=1) +
geom_abline(intercept = 0, slope = 1, color = "green", lwd = 1)
We observe that for both sexes the curves of the control patients are nearly overlapping the no-effect green line, as expected. The treatment curve for women is very near the no-effect green line. There seems to be a very mild positive effect for women whose weights before the intervention were below 150 pounds. The intervention seems to be effective for weight loss for men who underwent the exercise programme, as the blue dotted curve is clearly below the no-effect green line.
Let us split the data into age categories.
ggplot(data_exer1, aes(x = PRE_WEIGHT, y = POST_WEIGHT, col = Sex, linetype = trt )) +
geom_point(alpha = 0.2) +
facet_grid(.~Age_cat) +
stat_smooth(method = "loess", se = FALSE, lwd = 1) +
geom_abline(intercept = 0, slope = 1, color = "green", lwd = 1) +
theme(legend.position = "bottom")
The observations we made before still hold when we split the group by age category. The effect of the exercise programme seems to be more beneficial for men aged 45 or older who weigh over 200 pounds.
Now, let us split the group by race
ggplot(data = subset(data_exer1,is.na(Race) == "FALSE"), aes(x = PRE_WEIGHT, y = POST_WEIGHT, col = Sex, linetype = trt )) +
geom_point(alpha = 0.2) +
facet_grid(.~Race) +
stat_smooth(method = "loess", se = FALSE, lwd = 1) +
geom_abline(intercept = 0, slope = 1, color = "green", lwd = 1) +
theme(legend.position = "bottom")
We generally observe the same trends as before for women and for men although the treatment seems to be more effective for white men, specially for those who are overweight.
Define proportional weight loss as prWL = (POST_WEIGHT - PRE_WEIGHT)/PRE_WEIGHT, that is the proportion of weight lost relative to initial weight.Explore the distribution of this new variable conditional on the different subgroups. Values of prWL near zero are for little or no effect of the intervention, large negative values are for a positive effect for weight loss, as a percentage of the original weight.
Solution
data_exer1 = mutate(data_exer1, prWL = (POST_WEIGHT - PRE_WEIGHT)/PRE_WEIGHT)# compute and append a new column prWL
glimpse(data_exer1)
## Observations: 5,000
## Variables: 11
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMA...
## $ Age <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Age_cat <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,3...
## $ Race <fct> White, White, White, Hispanic, White, NA, White, H...
## $ trt <fct> Treatment, Treatment, Treatment, Treatment, Treatm...
## $ PRE_SRH <ord> Good, Poor, Satisfactory, Poor, Poor, Good, Excell...
## $ POST_SRH <ord> Poor, Very Poor, Good, Good, Poor, Excellent, Exce...
## $ PRE_WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
## $ prWL <dbl> -0.074074074, -0.006493506, -0.101562500, -0.03278...
ggplot(data_exer1, aes(x = trt, y = prWL, col = Sex)) +
geom_boxplot()
For control patients there is little variation and, there is a small placebo effect in men. perhaps knowing that their weight was going to be monitored brought awareness to their diet and life style. For treatment patients we observe more variation in the proportional weight loss, with a marked more positive effect for men than for women. There are a few exceptional cases, where patients lose a large proportion of their initial weight, nearly 25% for some men, whereas other men gain weight, although not a large proportion of their initial weight. 50% of men lose about 9% or more of their initial weight. For women the effect is not as pronounced with a median proportional weight loss of about 2% and some exceptional cases gaining more than 10% of their original weight by the end of study (perhaps emotional eating due to anxiety of weight being monitored?).
Let us focus now only on treatment cases and see how the effectiveness of the intervention in terms of weight loss is related to the other variables.
ggplot(subset(data_exer1, is.na(Race) == "FALSE" & trt == "Treatment"), aes(x = Age_cat, y = prWL, col = Sex)) +
geom_boxplot() +
scale_x_discrete("Age group") +
scale_y_continuous("Weight lost as a proportion of initial weight") +
facet_grid(.~ Race) +
theme(legend.position = "bottom")
We can see in the previous plot, for example, that proportional weight loss seems to be more or less equally effective across age categories for white women, slightly more effective for older white, Asian and black men, but less effective for older Hispanic men. For black women the intervention seems to become more effective as age increases.
We conclude that the effect of the exercise programme seems to have a beneficial effect in both self-rated health and weight loss for many patients and that the effectiveness of the intervention is related to demographic characteristics such as age, sex and race.